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ABSTRACT 

This paper examines two methods for finding whether long-range correlations exist in DNA: a fractal measure and 
a mutual information technique. We evaluate the performance and implications of these methods in detail. In 
particular we explore their use comparing DNA sequences from a variety of sources. Using software for performing 
in silico mutations, we also consider evolutionary events leading to long range correlations and analyse these 
correlations using the techniques presented. Comparisons are made between these virtual sequences, randomly 
generated sequences, and real sequences. We also explore correlations in chromosomes from different species. 

1. INTRODUCTION 

DNA is a structure containing a long sequence of complimentary pairing bases, denoted by the symbol set 
{a,t,c,g} [1]. The genetic material in DNA undergoes a variety of different mutational events [2,3]. These 
mutational events can be considered as string rewriting rules [4] that lead to correlations in DNA. Repeated use 
of short sequences as promoters [5], or as intron markers [6] can give rise to very long-range correlations. 

A number of different techniques have been studied for examining long range correlations in DNA. These 
include Levy walks [7], Fourier transforms [8-10], and wavelets [11]. A number of people have attempted to 
explore this by considering power law relationships in power spectra of DNA sequences. This purports to show 
long-range correlations and also to show differences between regions of DNA. In this paper we examine long-range 
correlations with mutual information techniques [12], and briefly explore the Higuchi fractal method [13]. 

DNA sequences contain a number of coding regions. These are regions that code for protein and are marked 
with a stop and start codons (but the presence of these does not necessarily indicate a coding region). Coding 
regions may contain introns, which are regions that get spliced (cut) out before translation from the RNA 
template before the protein is made according to the code on the RNA template (which in turn comes from the 
DNA). Non-coding regions may just be junk, or may code for regulatory RNAs [14], such as the Xist gene which 
switches off the extra X chromosome in women [15]. 

In this paper we show that these long-range correlations exist for real sequences of DNA and virtual sequences 
of DNA, but not random sequences of DNA. The virtual sequences of DNA are those produced by our software, 
which simulates a variety of mutational events. The random DNA has a random sequence generated in software, 
so it should contain almost no correlations. We also explore whether or not the power spectra show any differences 
between coding and non-coding DNA, and between different species of bacteria. 

2. SEQUENCES EXAMINED 

For exploring correlations at very large distances, we used Homo sapiens chromosome 20 [16], Mus musculus 
chromosome 2 [17,18] and Escherechia coli [19]. 

2.1. Real sequences 

In order to compare correlations in real DNA with those in short random and short virtual DNA sequences, we 
chose a selection of twenty short, real gene sequences from various organisms. Their accession numbers, and 
descriptions are shown in Tabled 
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Table 1. These are the GenBank [20] accession numbers and descriptions of the twenty short, real mRNA sequences 
used. 



NM.076575 


C. elegans essential Drosophila huncback like. 


NM.169234 


Drosophila melanogaster hunchback CG9786-PB (hb). 


BC016664 


Homo sapiens cone-rod homeobox. 


BC016502 


Mus musculus cone-rod homeobox containing gene. 


NM.031888 


Homo sapiens pro- melanin-concentrating hormone-like 2. 


NM.010410 


Rattus norvegicus /3-catenin. 


AY438620 


Arabidopsis thaliana GLUR3 (Atlg05200) mRNA. 


AY148346 


Mus musculus sentrin-specific protease. 


BC062048 


Rattus norvegicus MAP kinase- activated protein kinase 2. 


BC002377 


Homo sapiens PTK7 protein tyrosine kinase 7. 


NM.001437 


Homo sapiens estrogen receptor 2 (ESR2). 


BC057647 


Mus musculus visual system homeobox 1 homolog. 


BC060890 


Danio rerio retinal homeobox gene 1. 


BC004108 


Homo sapiens immunoglobulin superfamily, member 8. 


BC048387 


Mus musculus immunoglobulin superfamily, member 8. 


NM.033615 


Mus musculus ADAM33. 


NM.025220 


Homo sapiens ADAM33, transcript variant 1. 


BC062067 


Rattus norvegicus SRY-box containing gene 10. 


BC002824 


Homo sapiens SRY-box containing gene 10. 


XM.128139 


Mus musculus SRY-box containing gene 10. 



2.2. Random sequences 

To compare the mutual information in real and virtual sequences, we generated twenty random sequences of 
length 10 000 bases, where all four bases have equal probability of appearing in each position. 

2.3. Virtual sequences 

The twenty virtual non-coding regions are generated by the latest version of our software for exploring mutations 
in DNA [21]. It implements the following in silico operations: 

• Base substitutions, where one base pair has been replaced with a different base through some mechanism 
(such as UV irradiation with an absent or partly unsuccessful repair process). 

• Additions, where a base pair has been added to the sequence. 

• Deletions, where a base pair has been removed from the sequence. 

• Flips, where part of a sequence has been replaced by its reverse complement. 

• Fills, where a sequence of repetitive elements (of length 1 to 4) has been inserted up to 50 times. The exact 
number of repetitions is chosen at random from a uniform distribution, as is the length. 

• Copies, where part of a sequence (up to 100 bases in length) has been copied. As with the fill operations, 
the length is chosen from a uniform random distribution. 

The flip, fill, and copy operations are illustrated in Fig. ^ These operations are meant to simulate small scale 
general mutations, and larger scale ones of the type that occur in non-coding DNA. In each run of the simulator 
we took one of the random DNA sequences and used up to 30, the exact number chosen from a uniform random 
distribution, of each of the above mechanisms to generate long-range correlations in the DNA sequences. With 
some experimentation we found that, as one would expect, the fill and copy mechanisms are the primary drivers 
in creating long-range correlations. 



Fill : 



ATTG ATTG ATTG ATTG 



Copy: 



ATGGCCGATTATT 



4 



ATGGCCGATTATT ATGGCCGATTATT 



Flip : 





ATGGCCGATTATT 




V 




AATAATCGGCCAT 





Fig. 1. This figure shows the three operations: fill, where we have added a sequence of repetitive elements of length 4 in 
this case; copy, where we have copied part of a DNA sequence; and flip, where we have replaced part of the DNA sequence 
by its reverse complement. 



3. METHODS FOR EXPLORING CORRELATIONS IN DNA 
3.1. Mutual information functions 

Another method for showing the existence of long-range correlations in DNA is to use the mutual information 
function, as given in Eq. ^ below. This approach has been shown to distinguish between coding and non-coding 
regions [22]. We explore the use of the the mutual information function given in Eq. ^ 

for symbols a, (3 E A (in the case of DNA, A = {a, t, c, g}). P a p(d) is the probability that symbols a and (3 are 
found a distance d apart. This is related to the correlation function in Eq. [3 [12]: 



(2) 



where a a and ap are numerical representations of symbols a and (3. As discussed by Li [12], the fact that we are 
working with a finite sequence means that this M(d) overestimates the true M T (d) by 

K (K — 2) 

M(d)-M T (d)n \ N > (3) 

where K is the number of symbols (for DNA this is always 4) and N is the sequence length. The shortest 
sequence used was the sequence of the Homo sapiens immunoglobulin superfamily, member 8 gene (GenBank 
accession BC004108), which was = 1750 base pairs in length. Thus for this gene the difference between 
the estimated and real mutual information is w „ 4 1 x - 2 - n = 0.002, which is an order of a factor of ten less than 
the mutual information estimate for this gene. Furthermore, since in our results below we compare the mutual 
information of the sequence with that of the randomized sequence, we are effectively eliminating this inaccuracy. 



The mutual information is (at least for large d) proportional to the correlation squared, T(d) [12]. Even for 
small d, the mutual information function is still providing an estimate of the correlations. The range of d we 
used (up to 1024) means we are providing a reasonable estimate of the correlations at these larger distances. In 
biological terms, we are capturing correlations within regions of genes, and between promoter regions and DNA. 
This length is not sufficiently large to explore longer range correlations such as those between genes (typically 
tens of thousands of bases) or those that might exist between activator or silencer regions and promoters, again 
on the order of tens of thousands of bases [5] . In the whole chromosome analysis we are finding repeating elements 
and other correlations in junk DNA in addition to correlations within genes. 

3.2. Higuchi fractal measure 

A method for determining correlations in sequences is to use the Higuchi fractal method [13]. In using this 
method we compute 

k-i N-1 L^J 

L ( k )=12 TW^TTn. 12 ^s(x(m + ik)-x(m+{i-l)k)), (4) 

m=0 L h J t=l 

for k = 1, . . . , 1024 over non-overlapping subsequences of length 4000. The sequence x (i) is generated by mapping 
the sequence of bases, s (i): 



1.0, 


s(i) 


= a. 


0.5, 




= t, 


-0.5, 




= c, 


-1.0, 


s(i) 


= 9- 



(5) 



Performing linear regression on logL (k) versus logfc then gives a slope of — D, where D is the estimate of the 
true fractal measure. For a high degree of correlation, we expect a value of D closer to one. 

One can also apply the Higuchi method to the density of bases in blocks, as carried by Lu et al. [23], however 
this does not provide a measure of correlations in the sequence as the authors claim, but rather correlations in 
the density function. In the fashion we use it, we are detecting correlations in the sequence, though as with the 
mutual information function we only explore correlations up to 1024 base pairs. 



4. RESULTS 

4.1. Short DNA sequences 

To analyze the short DNA sequences (real, virtual, and random) using the mutual information function ^ we 
compared the mutual information plot with the average +/- standard deviation plot of the mutual information 
function for 100 randomized sequences with the same base distribution but in random order (thus eliminating 
correlations). Examples of this are shown in Fig. 

We determined the maximum distance at which significant correlations were present, up to the maximum 
distance studied of 1024. The results of this for the 20 real, virtual, and random sequences are shown in 
Table^J No long-range correlations are present in our benchmark random sequences as one would expect, however 
correlations up to distance d > 1024 are present in our virtual sequences, and even longer range correlations of 
distance d > 1024 can be found in real sequences. Because the mutation process used to generate the virtual 
sequences was random, there was a significant variation in the length of correlations present. This corresponded 
well to the number of repeated elements and copy mutations, in particular with the copy mutations. Future 
work will attempt to quantify the mutual information values with a directed model of evolution where we take 
real sequences and apply mutation operators in a realistic fashion, for example point mutations are much more 
likely to be seen in the "wobble" positions of codons than elsewhere, and this in turn is much more likely than 
insertions and deletions. 

The results of using the Higuchi fractal method are shown in Tabled Note that these estimates are relatively 
independent of the choice of mapping of bases onto numbers (several different mappings were tried with variations 
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(a) This figure shows the plot of the mu- 
tual information function M(d) in Eq.0 
against base distance d for the sequence 
of the MAP kinase-activated protein ki- 
nase 2 gene from Mus musculus, shown 
in a darker line style, compared with 
the set of 100 randomized sequences of 
the same base distribution, the lighter 
band. The graph of mutual informa- 
tion in the MAP kinase gene mostly 
sits about the "noise floor" of the ran- 
domized sequences, in which the corre- 
lations have been destroyed. 

Fig. 2. These figures show the plots of the mutual information function M(d) in against base distance d for (a) a real 
sequence and (b) a virtual sequence. At larger distances, there are fewer symbols at that distance that are available for 
computing the mutual information, so the over-estimates increases in value, producing a slight slope to the graphs. 

on the order of 0.001), and the numbers are in fact overestimates of the true fractal dimension. The fractal 
dimensions appear unrelated to the mutual information distances, thus illustrating the fact that the mutual 
information function is a better characterization of the distances at which correlations are present. 

4.2. Whole chromosome sequences 

The results of analyzing chromosomes from E. coli, M. musculus, and H. sapiens using both the Higuchi fractal 
measure, D, and the mutual information function, M(d), indicate the presence of correlations up to the maximum 
length explored (1024). This is shown in Table 0J There is less variation in these measures for E. coli, which 
has a greater proportion of gene-coding DNA to other sequences, these gene-coding regions allow less room for 
repeating elements due to evolutionary and size constraints, and thus have a lower correlation distance. 

5. CONCLUSIONS 

We found long-range correlations present in short sequences of real DNA, "virtual" DNA, and throughout whole 
chromosomes. Our simulation of genetic mutation events in "junk" DNA with fill, copy, and mutate operations 
also produces long range-correlations approaching 1024 bases in length. Our negative test, with computer 
generated random sequences, succeeds in that we do not find any significant long-range correlations. These 
results confirm that mutational events in non-conserved regions of DNA can give rise to long-range correlations. 
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(b) This figure shows the plot of the 
mutual information function M(d) in 
Eq.^against base distance d for the vir- 
tual DNA sequence number 14, shown 
in a darker line style, compared with 
the set of 100 randomized sequences of 
the same base distribution, shown as a 
lighter band. The graphs mostly over- 
lap, indicating few significant correla- 
tions in the virtual sequence when com- 
pared with the randomized sequences 
containing little to no correlations. 



Table 2. This table shows the approximate (±50) distances at which the mutual information function drops down to 
the level of the uncorrelated sequences of the same base distribution. The numbering of the real sequences matches the 
ordering they are given in Table Q The numbering of the virtual sequences corresponds to the random sequence which 
was mutated to produce that virtual sequence, but bears no relationship to the numbering of the real sequences. 



Sequence number 


Random 


Virtual 


Real 


1 








> 1024 


2 








> 1024 


3 





> 1024 


700 


4 





100 


800 


5 





50 





6 








> 1024 


7 





850 


> 1024 


8 








> 1024 


9 








> 1024 


10 





800 


> 1024 


11 








> 1024 


12 





> 1024 


> 1024 


13 





100 


950 


14 





> 1024 


600 


15 





> 1024 


> 1024 


16 








> 1024 


17 








> 1024 


18 








> 1024 


19 





> 1024 


> 1024 


20 








> 1024 
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Sequence number 


Random 


Virtual 


Real 


1 


1.104 


1.103 


1.098 


2 


1.103 


1.094 


1.095 


3 


1.104 


1.094 


1.118 


4 
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1.086 


1.110 


5 


1.103 


1.086 


1.092 


6 


1.102 


1.094 


1.103 


7 


1.105 


1.100 


1.105 


8 
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1.102 


1.087 


9 


1.102 


1.093 


1.080 


10 


1.103 


1.099 


1.099 


11 


1.103 


1.089 


1.087 


12 


1.104 


1.103 


1.098 


13 


1.103 


1.099 


1.098 


14 


1.104 


1.091 


1.055 


15 


1.104 


1.100 


1.101 


16 


1.103 


1.103 


1.090 


17 


1.102 


1.102 


1.097 


18 


1.102 


1.091 


1.094 


19 


1.102 


1.099 


1.099 


20 


1.103 


1.099 


1.091 



Table 4. This table shows the average Higuchi fractal dimension D over blocks of length 4000 in the chromosomes listed, 
along with the variance, and the distance d at which correlations exist as determined by mutual information function in 
Eq.0 



Sequence 


mean (£>) 


var (D) 


d 


Eschercia coli K12, complete genome 


1.10039 


2.07 x 10~ b 


> 1024 


Mus musculus chromosome 2 


1.09691 


7.59 x 10~ b 


> 1024 


Homo sapiens chromosome 20 


1.089 


0.00991 


> 1024 
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